2305.02678v1 [cs.GR] 4 May 2023 


arXiv 


Real-Time Neural Appearance Models 


TIZIAN ZELTNER*, FABRICE ROUSSELLE*, ANDREA WEIDLICH*, PETRIK CLARBERG*, JAN NOVÁK“, 
BENEDIKT BITTERLI*, ALEX EVANS, TOMÁŠ DAVIDOVIČ, SIMON KALLWEIT, and AARON LEFOHN, 


NVIDIA, Global 


Fig. 1. Close-up renderings of a Teapot asset with our neural BRDF. Our model learns the intricate details and complex multi-layered material behavior of the 
ceramic, fingerprints, smudges, and dust which are responsible for the realism of the object while being faster to evaluate than traditional non-neural models 
of similar complexity. The system we present allows us to include such high-fidelity objects in real-time renderers in a scalable way. 


We present a complete system for real-time rendering of scenes with complex 
appearance previously reserved for offline use. This is achieved with a 
combination of algorithmic and system level innovations. 

Our appearance model utilizes learned hierarchical textures that are 
interpreted using neural decoders, which produce reflectance values and 
importance-sampled directions. To best utilize the modeling capacity of 
the decoders, we equip the decoders with two graphics priors. The first 
prior—transformation of directions into learned shading frames—facilitates 
accurate reconstruction of mesoscale effects. The second prior—a microfacet 
sampling distribution—allows the neural decoder to perform importance 
sampling efficiently. The resulting appearance model supports anisotropic 
sampling and level-of-detail rendering, and allows baking deeply layered 
material graphs into a compact unified neural representation. 

By exposing hardware accelerated tensor operations to ray tracing shaders, 
we show that it is possible to inline and execute the neural decoders effi- 
ciently inside a real-time path tracer. We analyze scalability with increasing 
number of neural materials and propose to improve performance using 
code optimized for coherent and divergent execution. Our neural material 
shaders can be over an order of magnitude faster than non-neural layered 
materials. This opens up the door for using film-quality visuals in real-time 
applications such as games and live previews. 
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1 INTRODUCTION 


Recent progress in rendering algorithms, light transport methods, 
and ray tracing hardware have pushed the limits of image quality 
that can be achieved in real time. However, progress in real-time 
material models has noticeably lagged behind. While deeply layered 
materials and sophisticated node graphs are commonplace in off- 
line rendering, such approaches are often far too costly to be used in 
real-time applications. Aside from computational cost, sophisticated 
materials pose additional challenges for importance sampling and 
filtering: highly detailed materials will alias severely under minifi- 
cation, and the complex multi-lobe reflectance of layered materials 
causes high variance if not sampled properly. 

Recent work in neural appearance modelling [Kuznetsov et al. 
2022; Sztrajman et al. 2021; Zheng et al. 2021] has shown that multi- 
layer perceptrons (MLPs) can be an effective tool for appearance 
modelling, importance sampling, and filtering. Nevertheless, these 
models do not support film-quality appearance; a scalable solu- 
tion that can handle high-fidelity visuals at real time has yet to be 
demonstrated. 

In this paper, we set our goal accordingly: to render film-quality 
materials, such as those used in the VFX industry, in real time. These 
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Fig. 2. We show rendered images of five reference materials created with a layering approach similar to [Jakob et al. 2019] that we approximate with neural 


models for representing the BRDF and importance sampling. All objects are challenging for real-time renderers due to their complex reflection behavior and 
high resolution textures (see Table 1). The corresponding node graphs are shown in the supplemental. 


Table 1. Statistics of our reference materials. 
#layers #graphnodes #textures total pixels 
TEAPOT ceramic 5 37 70 1174.41 MP 
Teapot handle 2 41 11 152.305 MP 
SLICER handle 5 20 3 201.327 MP 
SiicER blade 3 54 16 324.272 MP 
INKWELL 5 49 4 201.327 MP 


materials prioritize realism and visual fidelity, relying on very high- 
resolution textures. Layering of reflectance components, rather than 
an uber-shader, is used to generate material appearance yielding 
arbitrary BRDF combinations with hundreds of parameters. For 
these reasons, porting to real-time application is challenging. 

In order to render film-quality appearance in real time we i) care- 
fully cherry-pick components from prior works, ii) introduce algo- 
rithmic innovations, and iii) develop a scalable solution for inlining 
neural networks in the innermost rendering loop, both for classical 
rasterization and path tracing. We choose to forgo editability in 
favor of performance, effectively “baking” the reference material 
into a neural texture interpreted by neural networks. Our model can 
thus be viewed as an optimized representation for fast rendering, 
which is baked (via optimization) after editing has taken place. 

Our main focus is on developing an initial system that fits our 
criteria, and it naturally comes with limitations which we deemed 
acceptable, but hope to address in future work. Much like prior 
work, our method is not mathematically constrained to conserve 
energy or ensure reciprocity. Certain special cases, such as BRDFs 
with delta components, cannot be perfectly reproduced. We do not 
currently support refraction, although the latter could be added later 
with changes to the model. 

Our model consists of an encoder and two decoders, with the neural 
(latent) texture in between. The encoder maps BRDF parameters 
to a latent space, thereby converting a set of traditional textures 
(per-layer albedo, normal map, etc.) into a single multi-channel 
latent texture. Using the encoder, instead of optimizing the texture 


directly, is key to support materials with high-resolution textures. 
The latent texture is decoded using two networks: an evaluation 
network that infers the BRDF value for a given pair of directions, 
and a sampling network that maps random numbers to sampled 
(outgoing) directions. 

Our main algorithmic contributions can be characterized as em- 
bedding fixed-function elements—graphics priors—in the two neural 
decoders. First, we insert a standard rotation operation between 
trainable components of the BRDF decoder to handle normal mapped 
surfaces. Second, we utilize a network-driven microfacet distribution 
for importance sampling. These priors are necessary to efficiently 
utilize the (limited) expressive power of small networks. 

On the system level, we present an efficient method for inlining 
fully fused neural networks in rendering code. To the best of our 
knowledge, this is the first complete and scalable system for running 
neural material shaders inside real-time shading languages. A key 
contribution is an execution model that utilizes tensor operations 
whenever possible and efficiently handles divergent code paths. This 
allows fast inferencing in any shader stage including ray tracing 
and fragment shaders, which is important for adoption in game 
engines and interactive applications. We demonstrate graceful cost 
scaling in scenes with many (different) neural materials, running 
inside a real-time path tracer. Our neural model has a fixed evalu- 
ation cost, independent of the material complexity, allowing us to 
render complex materials that are the norm in offline rendering. To 
demonstrate this, we authored several highly detailed assets with 
layered materials (Figure 2) that provide visual detail down to a 10 
cm viewing distance. We can reproduce the visual fidelity of such 
complex assets, with shading being up to 10x faster than the orig- 
inal, moderately optimized shading models, while also providing 
additional sampling and filtering facilities (Figure 1). 

Achieving the visual fidelity of such complex assets at real-time 
rates required innovations both in the neural model and at the 
system level, and our paper is the result of contributions to both. 
We believe the joint evolution of models and systems to be crucial 
to bringing neural shaders to real-time, and we built our system to 
serve as a solid foundation in this regard. 


2 RELATED WORK 


In this section, we review previous work related to neural material 
representation, filtering, and sampling, and refer to Pharr et al. 
[2016] for a detailed overview of classical material models. 


2.1 Neural appearance modeling 


We focus on representing existing materials neurally and rendering 
them in real time on classical geometry. We therefore do not utilize 
ray marched neural fields [Baatz et al. 2022; Mildenhall et al. 2020; 
Miller et al. 2022], although these could present a viable alternative 
in the future. Our goals generally align with prior work on neural 
BRDFs [Fan et al. 2022; Kuznetsov et al. 2019, 2021; Rainer et al. 
2020, 2019; Sztrajman et al. 2021; Zheng et al. 2021]. Common to 
these methods is a conditioning of a neural network on a pair of 
directions, and optionally a trained latent code. Latent codes are 
typically stored in a texture [Thies et al. 2019] and sampled using 
classical UV mapping to support spatially varying BRDFs. 
However, we differ from prior work on a number of key axes: 


Obtaining latent textures. Kuznetsov et al. [2019] in their NeuMIP 
work employ direct optimization, updating a randomly-initialized 
latent texture via back-propagation, a simple but costly solution for 
large textures with millions of texels. In contrast, Rainer et al. [2019] 
encode a set of BRDF measurements into latent codes. We pursue a 
hybrid approach: we first train an encoder and, partway through 
training, use it to create a hierarchical latent texture, which we then 
finetune through direct optimization. This approach combines the 
speed of the encoder-decoder architecture with the flexibility of 
direct optimization. 


Encodings and priors. Both Zheng et al. [2021] and Sztrajman et al. 
[2021] reparametrize input directions into a half-angle coordinate 
system [Rusinkiewicz 1998]. While this specific encoding did not 
provide much benefit in our case, we leverage the principle and 
incorporate a novel graphics prior—rotation to learned shading 
frames—to better handle normal-mapped, layered materials. 


Rendering novel BRDFs. Fan et al. [2022] are able to render novel 
BRDFs not part of the training set through layering of latents. How- 
ever, this requires large neural networks unsuitable for real-time. 
We focus on small networks that render only materials they were 
trained on and do not pursue generalization. We support layered 
materials by capturing the joint effect of all layers at once, dispens- 
ing with the explicit layering of the original material, and avoiding 
any layering of neural components. 


2.2 Neural material filtering 


Aliasing due to shading is commonly addressed with mipmapping, 
but requires special care for non-diffuse materials as their appear- 
ance can change significantly with linear filtering. Methods such 
as LEAN [Olano and Baker 2010], LEADR [Dupuy et al. 2013] and 
MIPNet [Gauthier et al. 2022] use statistical methods or neural down- 
sampling to more closely match the prefiltered ground truth. While 
these approaches tune the parameters of traditional BRDFs, we in- 
stead train neural models and hierarchical textures to represent the 
filtered appearance directly, similarly to Kuznetsov et al. [2021] and 
Bako et al. [2022], albeit with a different interpolation scheme (see 
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Section 4.1). However, we still leverage LEAN [Olano and Baker 
2010] as a graphics prior to filter the inputs of our encoder. 


2.3 Neural material importance sampling 


Prior work on the importance sampling of neural materials can clas- 
sified as: i) utilizing an analytical proxy distribution, ii) leveraging 
normalizing flows, and iii) warping samples with a network directly. 

We utilize the first approach, in which a network parameterizes 
a standard analytical distribution. In contrast to Sztrajman et al. 
[2021] and [Fan et al. 2022], who use the Phong-Blinn model or an 
isotropic Gaussian mixed with a cosine distribution, we propose to 
leverage a standard microfacet model with the Trowbridge-Reitz 
normal distribution function (NDF) [Trowbridge and Reitz 1975; 
Walter et al. 2007]. The microfacet model better handles anisotropy 
that is prevalent in (filtered) realistic materials. 

Normalizing flows for sampling [Dinh et al. 2017] were first uti- 
lized for neural BRDFs by Zheng et al. [2021]. With sufficiently 
large networks, normalizing flows can accurately match intricate 
distributions. We implemented a flow with piecewise quadratic 
warps [Miller et al. 2019], but we found it challenging to match the 
quality of our analytical proxy at comparable runtime performance. 

The third approach, using the network directly to warp samples, 
has been recently explored by Bai et al. [2022] who aid training 
of the network with 2D optimal transport. This method, dubbed 
importance baking, has the drawback that the learned density only 
approximately matches the true Jacobian determinant of their warp. 
This leads to potentially unbounded bias, and we exclude this option 
to maintain compatibility with physically based renderers. 


3 OVERVIEW 


Our goal is to reproduce the appearance of real materials that stems 
from the interaction of light with matter. It can be described using 
the spatially varying bidirectional reflectance distribution function 
(SVBRDF) f (x, wi, wo) that quantifies the amount of scattered dif- 
ferential radiance dLo (X, œo) due to incident radiance L;(x, œi): 


dLo (x, wo) 
Li (x, wo) cos 0;da; ’ 


(1) 


fK Oj, @o) = 


where x is a surface point, and œi, œo are incident and outgoing 
directions, respectively. The SVBRDF can be integrated over the 
upper hemisphere H? to produce directional albedo a(x, @o): 


A(X, Wo) = Ja f(X, œi, wo) cos hido; . (2) 


We aim to represent both of these quantities with our model, which 
is illustrated in Figure 3. 

We design our model to serve as an optimized representation 
of existing (reference) SVBRDFs. That is, given a target material 
f(X, @j, @o), we provide a function g ~ f that closely approximates 
the reference material and can be evaluated in real time. To be useful, 
our system must satisfy a number of properties: 


Visual fidelity. Our main goal is to faithfully reproduce a broad 
range of challenging materials, including multi-layer materials with 
low-roughness dielectric coatings, conductors with glints, stains, 
and anisotropy. We wish to go beyond fitting to spatially uniform 
measured material datasets [Dupuy and Jakob 2018; Matusik et al. 
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Fig. 3. We use our neural BRDFs in a renderer as follows: for each ray that hits a surface with a neural BRDF, we perform standard (u,v) and MIP level l 
computation, and query the latent texture of the neural material. Then we input the latent code z(x) into one or two neural decoders, depending on the needs 
of the rendering algorithm. The BRDF decoder (top box) first extracts two shading frames from z(x), transforms directions œi and @ into each of them, and 
passes the transformed directions and z(x) to an MLP that predicts the BRDF value (and optionally the directional albedo). The importance sampler (bottom 
box) extracts parameters of an analytical, two-lobe distribution, which is then sampled for an outgoing direction @ 9, and/or evaluated for PDF p(x, œi, @o). 


2003], and want to explicitly address materials with high resolution 
textures (4k and above) with detailed normal maps. 


Level of detail. Unfiltered high-resolution materials tend to alias 
under minification and properly filtered reflectance can change sig- 
nificantly within a pixel footprint. We seek a solution that supports 
filtered lookups of the material and thus enables all-scale rendering 
at low sample counts. 


Importance sampling. In addition to representing the BRDF, we 
need an effective importance sampling strategy to permit deploy- 
ment in Monte Carlo estimators, such as path tracing. This includes 
the traditionally challenging problem of importance sampling fil- 
tered versions of the material. 


Performance. Our neural representation is geared towards real- 
time applications, where material evaluation may only use a small 
fraction of the total frame time. We require compatibility with path 
tracing, where materials are evaluated at random locations over 
many bounces. This precludes large networks and models relying 
on convolutions. 


Practicality. While the optimization of our neural material hap- 
pens in an offline process, training times have to remain reasonable 
even for high material resolutions (4k and beyond) for the system 
to remain practical. Days of training time are not acceptable. 


In Sections 4 and 5, we describe our neural architecture and 
its training procedure, following with a comparative analysis of 
individual components in Section 6. Since real-time performance is 
one of our main goals, we dedicate Section 7 to the task of efficiently 
evaluating the neural model from inside ray tracing shaders. We 
conclude by demonstrating the quality and runtime performance 
on a number of challenging scenes in Section 8. 


4 NEURAL BRDF DECODER 


In this section, we describe the architecture of our appearance model 
illustrated in Figure 3. The model consists of two main components: 
a latent texture and two neural decoders. All these components are 


jointly optimized to represent a specific material or a set of materials; 
details of the optimization procedure (e.g., encoding of the latent 
texture) follow in the next section. 

The latent texture represents spatial variations of the material 
with a compact, eight-dimensional code denoted z. Given a query 
location x and the corresponding latent code z(x), the BRDF value 
is inferred by a neural decoder g with trainable parameters 0: 


f(X, œi, @o) © g (z(x),T - œi, T -@; 8) , (3) 


where T represents a transformation of incident and outgoing direc- 
tions to a number of learned shading frames. Next, we discuss the 
properties of the latent texture z and then describe the procedure of 
extracting T. 


4.1 Latent texture 


Similarly to prior works [Kuznetsov et al. 2021; Thies et al. 2019], we 
store latent codes in a UV-mapped, hierarchical texture, where each 
texel characterizes the appearance of the object at a given spatial 
location and scale. To maintain the fidelity of the original material, 
we set the resolution of the finest level to the texture resolution of 
the original material, and we leverage its UV-parametrization to 
preserve the original texel density. 

Highly detailed materials may cause severe aliasing under minifi- 
cation (Figure 4, left columns in (a) and (b)). By default, our neural de- 
coder would reproduce such aliasing. To avoid this, the hierarchical 
latent texture stores the latent codes in a texture pyramid [Kuznetsov 
et al. 2021; Thies et al. 2019]. Each level of the pyramid contains 
latent codes that characterize the original material filtered with a 
specific filter radius. The decoder is trained to infer the properly 
filtered BRDF value for all levels of the pyramid (Figure 4, middle 
columns in (a) and (b)). 

During rendering, we first determine the pixel footprint at the 
intersection point, and project it into UV space [Akenine-Méller 
et al. 2021]. We then determine the appropriate level of the texture 
pyramid to sample based on the area of the footprint. 

The level index may be fractional and lie between two levels of 
the pyramid. We probabilistically select one of them using Russian 
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Fig. 4. Highly detailed materials will alias significantly when rendered 
without supersampling (left columns, unfiltered). Supersampling averages 
high frequency glints and produces a filtered material, but at impractical 
sample cost for real-time (right columns, ground truth at 512 SPP). Our 
neural material can render filtered materials without aliasing at any distance, 
without supersampling (middle columns, ours). 


roulette, and fetch the latent code via bilinear interpolation within 
the level. This introduces a small, but bounded amount of variance. 
We found this to yield higher quality than the more commonly used 
method of trilinearly interpolating the latent codes. This is likely 
because the latter strategy induces the additional constraint that 
the latent interpolation produce plausible BRDF values across levels, 
even though they may store very different content. 


4.2 Transformation to learned shading frames 


Our focus on real-time applications severely constrains the size 
of the decoder network. This makes it all the more important to 
incorporate graphics priors into the architecture to handle realis- 
tic materials, such as those exemplified in Figure 2. These layered 
materials produce intricate SVBRDFs, where reflection lobes shift 
in direction as we move over the surface. Such effects are readily 
modeled in classical materials via textured transformations, e.g., 
using normal maps, but are hard to achieve for a standard MLP. 

A material may feature as many normal maps as scattering lay- 
ers. We aim to compress the stack of layers, but still provide the 
model with enough room to represent multiple normal maps. We 
therefore incorporate a transformation module into the network, 
which transforms incident and outgoing directions into a number 
of learned shading frames (mult operation in Figure 3). Specifically, 
we use a single trainable layer to extract a fixed number N of nor- 
mals (n; ... ny) and tangent vectors (t; ... ty) from the latent code. 
Then we construct a basis (tj, bi, ni) for each i-th pair of normalized 
normals and tangents, and construct a combined transformation 
matrix T: 


hx hy tiz --- tnx In,y İN,z 
F= bix Diy biz tee bn, x bN,y bN,z : (4) 
N1,x N1,y Nz... AN,x "N,y "N,z 
The transformation layer then computes the product T - wi and 


T - @o, resulting in N new incident and outgoing vectors, one pair 
for each of the learned shading frames. The vectors are then fed 
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to the decoder. The transformation allows the model to rotate the 
input directions into multiple, spatially varying shading frames in 
a single operation, improving the representational power of the 
network. We analyze the benefits in Section 6. 


Discussion. It may not be immediately obvious why a vanilla MLP 
struggles with rotating directions. This is because, even though 
MLPs are built from matrix operations, they can only perform mul- 
tiplicative transformations of the inputs with the (fixed) network 
weights. They cannot readily multiply the input dimensions with 
each other. In our case, a decoder with a vanilla MLP cannot easily 
multiply œi, @> with the latent code, which stores spatial variations 
of the material. The decoder is forced to approximate the multi- 
plicative transform using its trainable layers, depleting its modeling 
capacity. Our approach is conceptually similar to (self-)attention 
models that augment neural networks with multiplicative trans- 
forms between activations [Rebain et al. 2022; Vaswani et al. 2017]. 


4.3 Importance sampling 


Using neural materials in a Monte Carlo renderer also requires an 
importance sampling technique. This is especially crucial in our real- 
time setting where acceptable variance levels need to be achieved 
at extremely low sample rates. 

We focus on a subset of samplers suitable for representation by 
a network: an invertible transform W from random variates u € 
[0, 1) into outgoing directions œo = W (u; x, œi), and its associated 
probability density function (PDF) p(@p; x, wi). Low variance results 
are achieved whenever the shape of p closely matches f. 

Optimizing an MLP to perform the sample transform W does 
not guarantee invertibility of W and tractable PDF evaluations. Im- 
portance sampling thus requires a different approach than BRDF 
evaluation. We draw inspiration from prior work and utilize a neu- 
ral network to drive an existing analytic proxy distribution that is 
invertible in closed form. Like Sztrajman et al. [2021] and Fan et al. 
[2022], we use a linear blend between a cosine-weighted hemispher- 
ical density and a specular reflection component, but we differ in 
the choice of the specular component. 

Instead of the isotropic models proposed earlier (e.g., Blinn-Phong 
model [Sztrajman et al. 2021] or a 2D Gaussian in projected half- 
vector space [Fan et al. 2022]) we use the more general, state-of-the- 
art microfacet model based on a Trowbridge-Reitz NDF [Trowbridge 
and Reitz 1975; Walter et al. 2007] including elliptical anisotropy and 
non-centered mean surface slopes [Dupuy 2015]. This is well-suited 
both to the strongly normal-mapped materials represented in our 
target materials, as well as filtered BRDFs that naturally produce 
anisotropic distributions; we demonstrate the advantage in Section 6 
and provide additional details of the sampler in Appendix A. 

We train an additional importance sampling decoder MLP that 
infers parameters of the analytic model from the same latent code 
as used for the BRDF evaluation. This is conceptually similar to 
Sztrajman et al. [2021], though we additionally feed wj into the de- 
coder to capture Fresnel-like effects where, e.g., the diffuse-specular 
mixing weights vary as a function of the incident angle. 
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Fig. 5. We optimize our model by uniformly sampling the UV domain of the reference material. We start by fetching surface parameters (e.g., albedo) encoding 
them using an MLP to a latent code, and interpreting it as a BRDF value using the decoder (path marked with @). Once the encoder is sufficiently trained, we 
construct the latent texture (2 by processing all texels, and then drop the encoder. We continue “finetuning” the latent texture by sampling the UV space and 
MIP levels of the texture and optimizing the texels directly @. We sample exponentially distributed filter footprints to optimize all levels of the latent texture, 


and train the decoder with prefiltered versions of the input material. 


5 TRAINING 


In this section, we discuss the training procedure for our decoder and 
latent texture (illustrated in Figure 5), as well as how our training 
data is generated. 

One major challenge in training highly detailed materials is the 
sheer number of parameters that need to be optimized. Although 
the number of network weights is small, the resolution of the latent 
texture matches the texture resolution of the source material and can 
be considerable: the ceramic body of the TEApoT (Figure 2) is defined 
using 14 4k x 4k textures totaling 235 million texels, or 2.5 billion 
latent parameters. Optimizing these parameters independently using 
backpropagation is impractical. 

Instead, we make use of an encoder in the first training phase to 
bootstrap latent codes, which we describe next. 


5.1 Encoder 


The encoder is a simple MLP that takes the parameters k(x) of 
the original material (albedo, roughness, normal maps, etc. for all 
material layers) at a given query location x as input, and outputs 
the corresponding latent vector z(x). To bootstrap the filtering, we 
prefilter the material parameters k(x) (using LEAN [Olano and 
Baker 2010]) for coarse MIP levels of the hierarchy. 

In the first training phase, the model is trained end-to-end by 
forwarding the latent code from the encoder directly to the decoder, 
bypassing the latent texture. 

After the decoder converges, we switch to the finetuning phase. 
The latent texture is initialized by evaluating the encoder for all 
texels, after which the encoder is dropped. The contents of the latent 
texture are then trained directly using backpropagation through the 
decoder. Because the encoder only participates in training, it has no 
impact on the evaluation cost during rendering. 

Beyond speeding up training, the encoder also improves the struc- 
ture of the latent space: it guarantees that similar material parame- 
ters are mapped to similar points in the latent space. This leads to 
better results under interpolation, and makes the job of the decoder 
easier. In contrast, direct optimization is prone to leaving portion of 
the random initialization noise in the latent texture, as analyzed in 
Section 6.2. 

The encoder can be optimized to encode multiple materials, or 
even the full appearance space spanned by the reference BRDF (by 


sampling its parameters uniformly). Since our latent textures have a 
large memory footprint, in practice we train each one individually 
along with its own encoder, unless stated otherwise. 


5.2 Data generation and optimization 


We generate training data by uniformly sampling the UV space of 
the target (multi-layered) material. For each sample, we generate 
random directions œj and wo by uniformly sampling their half 
and difference vectors [Rusinkiewicz 1998; Sztrajman et al. 2021], 
and evaluate the reference BRDF value. Each sample additionally 
contains: normal, tangent, albedo, roughness, and layer weight, 
exported for each of the layers. Depending on the layer count a 
single sample may require over a hundred floating point numbers. 
We generate the samples on the GPU online during training. 


Filtering. We discretely sample a pyramid level for each training 
sample from an exponential distribution, favoring finer levels. We 
average multiple sample points drawn from a Gaussian with appro- 
priate footprint for the level, and choose the number of samples 
proportional to the filter area. This sampling process is fast enough 
that it does not significantly impact training time. 


Mollification. Materials with very narrow peaks (e.g. the smooth 
glaze of the TEApor) lead to large training errors early in training 
and are challenging to learn for the network. To solve this, we ini- 
tially blur the material directionally by averaging multiple samples 
from a small cone centered on wo. The angle of the cone decreases 
during training, so that the network initially learns broad features 
of the material before converging to the reference. 


Optimization. We train the BRDF decoder and the importance 
sampler simultaneously to establish a shared latent space. The BRDF 
prediction is optimized using the Lj loss in log space [Zheng et al. 
2021]. The PDF of inferred importance samples @ , produced by 
the sampling MLP, is scored using the KL divergence against the 
current state of the learned BRDF (evaluated for the sampled wo). 
We found that training stability is improved when the KL loss does 
not impact the latent texture (only the sampling decoder). This way, 
the sampler learns how to interpret the latents without interfering 
with the main BRDF evaluation decoder. 

Albedo predictions, if enabled, are optimized using the L2 loss 
against one-sample MC estimates of Equation (2). 
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Fig. 6. Top row: Optimized latent textures (3 channels shown as RGB) for 
the neural INKWELL material at three levels of the MIP hierarchy. Bottom 
row: The corresponding distribution of latent (left) and network parameter 
magnitudes (right). All parameters lie comfortably within the (2714, 216) 
numerical range of FP16 normal numbers (excluding denorms), making 
quantization easy. The other materials show very similar distributions. 


We optimize our models using 300k iterations, processing two 
batches of 65k training samples in each iteration; one for optimizing 
the BRDF decoder and one for the sampler. This amounts to nearly 
40 billion (online-generated) material samples in total, with training 
times lasting around 4-5 hours per material on a single NVIDIA 
GeForce RTX 4090. Further details of the training procedure are 
provided in the supplemental document. 


Precision. We train master parameters for the BRDF decoder and 
sampler in 32-bit floating-point (FP32) precision. It is possible to 
make careful use of mixed precision training to further improve 
training performance without losing accuracy, but due to the small 
sizes of our MLPs we did not explore this option. For efficient infer- 
encing, we use post-training quantization to convert the parameters 
to half precision (FP16) at load time. Figure 6 shows a representative 
example of the distribution of parameters for the evaluation and 
sampling models. In all our example configurations, the numerical 
range of network parameters lie within the normalized range of 
FP 16. In future work, we plan to explore quantization aware training 
to further reduce runtime precision to INT8 or lower. 


6 MODEL ANALYSIS AND ABLATION 


Now that we have introduced our appearance model and its train- 
ing procedure, we will analyze the main technical novelties: i) the 
transformation into learned shading frames, ii) the anisotropic im- 
portance sampler, and iii) and the use of the encoder. We also demon- 
strate the filtering capabilities and the option of inferring albedo. 
A number of neural appearance models have been published in 
the past, addressing various aspects of appearance modeling, e.g., 
geometric level of detail [Kuznetsov et al. 2021, 2022], interpretabil- 
ity of the latent space [Zheng et al. 2021], or layering of neural 
components [Fan et al. 2022]. These are complementary to our sys- 
tem and could be incorporated in the future. In this work, we focus 
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on accommodating film-quality visuals and efficient execution on 
modern GPUs (presented in Section 7). 

Due to the difference in focus, it is hard to compare our work to 
previous approaches directly. Instead, we compare to two ablated 
variants of our model in Figure 7 and relate them to corresponding 
components in prior work. 


Vanilla MLP decoder w/ latent texture. The most basic variant uti- 
lizes only a hierarchical latent texture and a vanilla MLP decoder. As 
such, there is no explicit rotation to shading frames in the decoder, 
and the texels of the texture are optimized directly via backpropaga- 
tion. This variant can be viewed as the decoder by Sztrajman et al. 
[2021] extended to handle spatial variations using a hierarchical 
neural texture [Thies et al. 2019]. The model and the training pro- 
cedure is also conceptually close to the NeuMIP model [Kuznetsov 
et al. 2021], except that NeuMIP additionally features a UV-offsetting 
module for handling displaced surfaces. The results of this variant 
(Figure 7, first column) fail to correctly reproduce the spatial details 
of the reference material. 


Latent texture encoder. The second column in Figure 7 shows 
the benefits of adding the encoder (Section 5.1). The texture detail 
is reproduced more faithfully due to two main reasons. First, the 
encoder prevents situations where multiple texels with identical 
BRDF end up with different latent codes after optimization. Such 
surjective mapping of latents to BRDF values often occurs in the 
basic model (first column) depleting the modeling capacity of the 
decoder. Second, the encoder amortizes each training record over 
many latent texels instead of optimizing a single latent texel. While 
the spatial variations are captured well, the decoder is unable to 
capture the narrow reflection lobe of the Teapot ceramic. This 
suggests that the model has insufficient modelling capacity to handle 
both spatial variations and high-frequency reflections, which can be 
fixed by increasing the size of the decoder. The encoder is inspired 
by the work of Rainer et al. [2019] who use it for compressing BTFs. 


Transformation to learned shading frames. In the third column 
of Figure 7, we prepend the MLP decoder with the transformation 
of directions to two learned shading frames, which are extracted 
from the latent code using an extra trainable layer with 12 neurons. 
This constitutes our complete model. As discussed in Section 4.2, 
performing a multiplicative operation on the inputs explicitly spares 
the MLP from approximating it using its non-linear layers. The qual- 
ity of the results improves significantly, including effects that are not 
necessarily related to normal mapping. This suggests that modeling 
capacity retained by performing the explicit shading frame trans- 
formation is “invested” in better capturing the shape and spatial 
variations of the BRDF. 


Table 2 reports various statistical metrics averaged over all im- 
ages in Figure 7; metrics for individual images are provided in the 
supplemental document. 


6.1 Filtering 


We evaluate the quality of our filtering in Figure 8 by comparing 
individual levels of the latent pyramid to ground truth rendered 
with supersampling. Our filtered model is a good match up close, 
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Vanilla MLP decoder with latent texture With latent texture encoder With transformed @;, @ )—full model 
(basic variant) (improved training) (improved training and decoding) 


Reference 


Fig. 7. A qualitative comparison of two ablated variants and our full model. A vanilla MLP decoder with directly optimized latent texture (first column) 
provides limited quality. Training an encoder to produce the latent texture (second column) ensures that texels with identical appearance feature identical 
latent codes, easing the decoding to BRDF values. Augmenting the MLP decoder with an explicit transformation of directions to learned shading frames—our 
full model (third column)—further improves the reproduction of the reference image (last column). The bottom left corners show images of the 4LIP difference 
metric. The models without the shading frame extractor (first two columns) were equipped with an extra first layer with 8 neurons to roughly match the 
number of parameters of the full model. 


Reference Footprint-based Level 0 Level 1 Level 2 tee Level 5 


Fig. 8. We evaluate the quality of our filtering by comparing footprint-based level selection to fixed latent pyramid levels (rendered with supersampling) on 
the CHEESE SLICER asset at different distances. Up close, coarser levels show loss of small detail such as glints, which reflects in our filtered result. This is not 
the case for level 0, which is a near perfect match to the ground truth (at the cost of aliasing). From afar, all levels average to visually similar appearance. 


Table 2. Image error metrics averaged over the four images in Figure 7 for 
each of the three compared variants. Material-specific statistics are included 
in the supplemental material. 


Vanilla w/  w/ frame 

MLP encoder transform 

Mean 4LIP 0.2390 0.1956 0.0815 
Mean abs. error 0.0769 0.0652 0.0183 
Mean sqr. error 0.0682 10.1933 0.0057 
Mean rel. abs. error 0.2177 0.3439 0.0656 
Mean rel. sqr. error 0.0798 265.4018 0.0090 


SMAPE 0.2670 0.2397 0.0713 


but shows loss of small detail from a medium distance. This is 
because latent optimization does not work as well for coarser levels 
as it does for level 0 and slightly overblurs the result. This may 
be compensated by biasing our level selection towards finer MIP 
levels, at the cost of some aliasing. From afar, all levels have a similar 
appearance. 


6.2 Latent texture optimization 


We further analyze the benefits of using the encoder in Figure 9, in 
which we compare the latent textures of different configurations 
at MIP level 0. We visualize latent textures obtained via direct opti- 
mization (top row) and using the encoder at small (512 x 512, left) 
and large (4k x 4k, right) resolutions. The bottom insets show a 
close-up of the learned texture and the rendered appearance of this 
area. While direct optimization and the encoder perform compara- 
bly at small resolutions (as used for instance in NeuMIP [Kuznetsov 
et al. 2021]), the difference becomes apparent at high resolutions. At 
resolution 4k x 4k, the directly optimized texels receive roughly 64x 
fewer gradient updates than texels of the 512 x 512 latent texture. 
This results in the decoder having to map vastly different latent 
codes (due to random initialization) to the same BRDF value, hinder- 
ing its performance. Much of the initialization noise is still visible 
in the converged model. On the other hand, the encoder provides a 
more data- and compute-efficient approach, yielding high-fidelity 
visuals. All models were trained using the same amount of training 
data. Despite being computationally less intense during training, 
the models with direct optimization nearly doubled the training 
times (up to 10 hours) due to their significantly higher memory 
requirements. 


6.3. Importance sampling 


We compare the importance sampler described in Section 4.3 against 
a simplified variant resembling that from Sztrajman et al. [2021] 
and Fan et al. [2022]. This variant is trained to only produce two 
outputs: an isotropic roughness parameter and a relative weight 
for mixing the specular and diffuse components. Figure 10 shows 
the benefit of the more general approach in the context of level-of- 
detail rendering, where it is useful to sample both non-centered and 
anisotropic NDFs for normal mapped and filtered BRDFs. 

We also considered using samplers based on normalizing flows 
[Dinh et al. 2017] in our system. In particular, the variant described 
by Zheng et al. [2021] where the distribution of half-vectors is repre- 
sented by two piecewise quadratic warps [Müller et al. 2019], each 
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Real-Time Neural Appearance Models + 9 


4k x 4k 


512 x 512 


‘f render 


Fig. 9. Latent textures of the INKWELL asset optimized directly (top row) 
and using an encoder (bottom row). Direct optimization works well only for 
small textures (top left) but it struggles with high resolutions (top right) as 
independently optimizing individual texels is computationally inefficient; 
the latent texture still contains a large amount of initialization noise after 


many iterations. Therefore, we train an encoder (bottom row) that trans- 
forms PBR surface attributes into latent codes, and can be executed at 
any resolution. All analyzed configurations were optimized using the same 
amount of data. The left inset zooms-in on a small part of the texture that 
is partly visible in the rendered inset on the right. 


parameterized by an MLP (3 layers w/ 16 neurons). We found this 
to yield comparable sampling quality to our chosen approach, but 
it increases the total frame render time by a factor of 2-3.8x (see 
Figure 11), making it less viable in our real-time context. This is 
explained by the additional overhead of the warps and the need 
to evaluate a larger number of MLPs at shading time. Normaliz- 
ing flows generally run 4 MLPs at each hit: 2 when sampling an 
outgoing direction and 2 when evaluating the associated PDF, e.g. 
for computing multiple importance sampling (MIS) weights [Veach 
and Guibas 1995]. In contrast, our method only needs to query the 
sampling network once per hit and caches the resulting analytic 
proxy parameters for the subsequent sampling and PDF evaluation 
steps. 


6.4 Albedo inference 


Figure 12 demonstrates the ability of a data-driven BRDF model to 
learn additional material characteristics. The BRDF decoder outputs 
an extra RGB triplet approximating the albedo of the multilayer 
material. We optimize the triplet against (one-sample) estimates of 
the true albedo during training using the Lz loss, which ensures con- 
vergence towards the mean. The ability to predict albedo gives our 
approach an edge over complex materials composed of analytical 
models, that can only output texture values of individual compo- 
nents, since numerical albedo estimation is typically infeasible in a 
path tracer. The albedo value can be used, e.g., to guide a denoiser. 
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Fig. 10. The importance sampler (top row) reduces noise levels compared to a simpler variant only supporting isotropic specular reflections (bottom row), 
in the spirit of Sztrajman et al. [2021] and Fan et al. [2022]. Left: Fine details of a normal map are captured using a non-centered microfacet NDF. Right: 
At coarser MIP levels, the filtered distribution is strongly anisotropic. The zoomed views are rendered using 4 SPP. False-color images show the pixel-wise 
standard deviation and its mean across the entire inset. 
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Fig. 11. Pixel-wise standard deviation images of our importance sampler against an alternative implementation based on normalizing flows. The sampler 
architecture in the first column (using warps with 8 bins, matching that of Zheng et al. [2021]), is adequate for the glossy INKWELL metal it struggles with 
the highly specular peak of the TEAPoT ceramic. The second column (using a higher-quality warp with 16 bins) captures the peak and roughly matches the 
variance of our sampler based on the analytic proxy (third column). The last column shows corresponding (log scale) polar plots of the learned densities. The 
overlaid numbers report rendering time (for the full frame at 1 SPP) and the time to unit variance (TTUV), i.e. the product of mean variance and render time. 
This reveals a significant runtime overhead of normalizing flows. The size of the evaluation network is fixed at 2 layers w/ 32 neurons in all cases. 


7 INLINE NEURAL MATERIALS 


In this section, we describe the runtime system for inlining our neu- 
ral appearance model in ray tracing shaders. Similar to recent work 


matrix multiply-accumulate (MMA) operations in recent GPU archi- 
tectures by AMD,! Intel,? and NVIDIA,’ but these instructions are 
not exposed in current shading languages. Last, the execution and 


on real-time NeRFs [Miiller et al. 2022], we implement fully fused 
neural networks from scratch on the GPU. Instead of hand-written 
kernels however, we use run-time code generation to evaluate the 
neural model inline with rendering code. This allows fine-grained 
execution of neural networks at every hit point in a ray tracing 
shader program, intermixed with hand-written code. There are sev- 
eral technical challenges in making this possible. 

First, existing machine learning frameworks are built for coher- 
ent execution of neural networks in large batches. Tools for inte- 
grating neural networks in real-time shading languages such as 
GLSL or HLSL with potentially divergent execution, are largely 
non-existent. Second, we want to leverage hardware accelerated 


data divergence in a renderer are challenging for neural networks, 
which load large amounts of parameter data from memory. 

In the following, we discuss how we address each of these chal- 
lenges in order to reach real-time performance. 


7.1 Neural material shaders 


Our neural model consists of several small MLPs, interconnected 
by blocks of non-neural operations. We train materials offline and 


Thttps://gpuopen.com/learn/wmma_on_rdna3 
“https://www.intel.com/content/www/us/en/developer/articles/technical/ 
introduction-to-the-xe-hpg-architecture.html 
3https://developer.nvidia.com/tensor-cores 
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Fig. 12. The BRDF decoder can be trained to additionally infer the albedo 
of the material by optimizing its additional RGB output against a Monte 
Carlo estimate of the albedo of the reference material. 


export a description of the final model along with its learned hier- 
archical latent textures, stored as mipmapped 16-bit RGBA images. 
Texture compression of the latents is an interesting avenue for fu- 
ture work. In particular, neural texture compression [Vaidyanathan 
et al. 2023] may be very fruitful as the compression and neural 
material model could be trained end-to-end. 

The runtime system compiles the neural material description into 
optimized shader code. We target the open source Slang shading 
language [He et al. 2018], which has backends for a variety of targets 
including Vulkan, Direct3D 12, and CUDA. Slang supports shader 
modules and interfaces for logically modularizing code. We generate 
one shader module per neural material, implementing the same 
interface as hand-written materials. In other words, neural materials 
are executed by the renderer no differently than classical ones. 


Code Generation. GPUs use a single instruction, multiple threads 
(SIMT) execution model, where batches (wavefronts or warps) of 
threads execute in lockstep, with each thread operating on its own 
registers. In a shader, threads may be terminated or masked out 
due to control flow. Because each thread may process a different hit 
point and material, there is no guarantee that all threads in a warp 
evaluate the same network. 

We handle this by generating two code paths, optimized for di- 
vergent and coherent execution respectively. The shader selects 
dynamically per warp which path to take. In the divergent case, 
we rely on the hardware SIMT model to handle divergence and 
generate an unrolled sequence of arithmetic and load instructions. 
A majority of the instructions evaluate the large matrix multiplies 
in the MLP feedforward layers. We use fused multiply-add (FMA) 
instructions to operate on two packed 16-bit weights at a time. The 
weights are laid out in memory in order of access, and special care 
is taken to generate 128-bit vectorized loads. 


7.2 Tensor core acceleration 


Some recent GPU architectures offer hardware units for acceler- 
ating general matrix multiplication. While implementation details 
vary, core functionality is similar. We focus on NVIDIA’s tensor 
cores which provide many flavors of matrix multiply instructions, 
although the same idea applies to other architectures. 

These instructions are currently limited to compute APIs and 
are not exposed in shaders. To address this, we modified an open 
source LLVM-based DirectX shader compiler* to add custom in- 
trinsics for low-level access. This mechanism allows us to generate 


‘https://github.com/microsoft/DirectXShaderCompiler 
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Fig. 13. This partially open Cake Box is filled with 25 different neural 
materials. The statistics show that our megakernel path tracer achieves a 
high degree of shading coherency using shader execution reordering (SER) 
over all vertices along long light paths. 


Slang shader code evaluating neural networks very efficiently using 
tensor cores, which operate on 16 x 16 blocks of the weight matrix 
simultaneously. 

MMA instructions require cooperation across the warp, which 
limits this fast path to coherent warps where all threads evaluate 
the same material. Additionally, loading network parameters also 
benefits from coherent access, requiring careful consideration of 
how to construct coherent warps, which we discuss next. 


7.3. Shading coherency 


Neural materials allow us to reproduce a variety of materials using 
the same shader code, simply by swapping out network weights 
and latent textures. This improves warp utilization (and thus per- 
formance) even for workloads with traditionally high execution 
divergence, such as path tracing. 

However, the increase in data divergence puts pressure on the 
memory system, and we can extract additional performance by 
increasing shading coherence. Classical coherent approaches like 
wavefront path tracing [Laine et al. 2013; van Antwerpen 2011] store 
hits to memory and globally reorder them after each bounce, but 
the high bandwidth requirements fundamentally limit their perfor- 
mance. Recent hardware features such as Intel’s thread sorting unit 
(TSU)? and NVIDIA’s shader execution reordering (SER),° instead 
reorder work locally. We use a megakernel path tracer to keep paths 
on-chip, and benefit from the increased data coherence provided by 
SER. Figure 13 shows that the majority of warps are fully coherent 
(shading the same material with all threads active) with our path 
tracing architecture. 


7.4 Integration ina real-time path tracer 


To study quality and performance, we implement our system for 
neural materials in a real-time path tracer [Clarberg et al. 2022a,b] 
built on the Falcor rendering framework [Kallweit et al. 2022]. The 
path tracer uses next-event estimation with MIS [Veach and Guibas 
1995], and each path calls the eval, sample, and evalPdf material 
interface multiple times. 


Material complexity. In order to study rich materials, we added 
support for physically-based, layered material graphs expressed in 
the open standard MaterialX [Smythe and Stone 2021], a common 


Shttps://www.intel.com/content/www/us/en/developer/articles/guide/real-time-ray- 
tracing-in-games.html 
®https://developer.nvidia.com/sites/default/files/akamai/gameworks/ser- 
whitepaper.pdf 
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Fig. 14. The INKWELL scene where the metal uses the proposed neural BRDF. The remaining parts use analytical BRDFs. The first three columns show different 
sizes of the BRDF decoder, from fastest to the most accurate. In the corners we show a 4LIP error image and the rendering performance of an image with a 
single path sample per pixel (1 SPP) at 1920 x 1080 resolution using paths of up to length six. All images are rendered at 8192 SPP to suppress path tracing noise. 
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Fig. 15. The Stace scene with four materials that we approximate using the proposed neural BRDFs. We use a similar layout as in Figure 14. 4LIP error 
images are in the corners, timings quantify the cost of rendering a 1 SPP image of the scene at 1920x1080 resolution using paths of up to length six. All images 
are rendered at 8192 SPP to suppress path tracing noise. The rendering with neural BRDFs is 1.64x to 4.14x faster than the reference materials in full frame 


time (averaged over the views in Figure 14 and here). 


Table 3. Image error metrics averaged over all 7 views from Figures 14 and 
15. View-specific statistics are included in the supplemental material. 


2X16 2x32 3x 64 


Mean 4LIP 0.1087 0.0551 0.0444 
Mean abs. error 0.0439 0.0145 0.0121 
Mean sqr. error 1.3855 0.0107 0.0101 


Mean rel. abs. error 0.1042 0.0429 0.0347 
Mean rel. sqr. error 0.0353 0.0056 0.0035 
SMAPE 0.1449 0.0468 0.0363 


interchange format for high-fidelity materials in VFX and movie 
production. This allows authoring complex layered materials (c.f., 
Figure 2) in Houdini and other tools. All materials consist of multiple 
BRDFs combined through mixing or coating operations. Nearly all 
parameters are textured, with resolutions of 4k-8k per texture. Some 
materials stitch multiple (up to 14) 4k texture tiles for even higher 
resolution. We compile material graphs into Slang shader modules 
similar to how neural materials are handled. 


8 RUNTIME ANALYSIS AND RESULTS 


Our system is running on Direct3D 12 using hardware-accelerated 
ray tracing through DirectX Raytracing (DXR). All results are gener- 
ated on an NVIDIA GeForce RTX 4090 GPU at resolution 1920 x 1080, 
unless otherwise noted. We focus on evaluating quality and perfor- 
mance for path tracing with neural materials, and therefore disable 
denoising and other features that can bias the results. 

Performance is reported as total time in milliseconds (ms) for 
rendering a 1920 x 1080 image with one path sample per pixel (SPP). 
The timing in ms/SPP is representative for real-time path tracing, 
and can be scaled linearly to predict rendering time at higher SPP for 
applications such as high-quality preview rendering. Path length is 
capped at six path vertices (camera and light included) and Russian 
roulette is turned off for the purpose of these measurement. 

We use reference materials authored in Houdini, exported into 
the USD format, and programmatically converted into an optimized 
Slang code that implements the shading graph as a weighted (@;- 
dependent) combination of standard BRDF models. Each material 
comprises multiple layers, where each layer is driven by a number 
of textures; the statistics are provided in Table 1. 


8.1 Visual accuracy 


In Figures 14 and 15 we compare the visual quality and rendering 
performance of three configurations of the neural BRDF decoder 
(the importance sampler always comprises 3 hidden layers with 
32 neurons each). As expected, quality varies with the size of the 
decoder. The largest configuration, with 3 hidden layers and 64 
neurons, reproduces the reference material well, with most details 
and colors captured accurately. The errors appear mostly at grazing 
angles of near-specular materials, e.g., the ceramic TEAPoT body 
near to the silhouette. We tested a number of hyper-parameter 
configurations, and while some successfully reduced the grazing 
angle artifacts (e.g., using Lz loss), the quality elsewhere degraded, 
sometimes significantly. In order to escape this “zero-sum” game, 
we posit that another graphics prior is needed for handling Fresnel 
effects; we leave this to future work. 
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Table 4. Full frame performance in ms/SPP with three different BRDF 
decoder architectures (importance sampler is always 3 x 32). Column labels 
denote the number and width of hidden layers. Numbers in parenthesis 
show speed up over the reference material, reported in the last column. 


2x 16 2x 32 3x64 Ref. 
INKWELL, View 1 3.64 (4.01x) 4.36 (3.34x) 9.94 (1.47x) 14.58 
INKWELL, View 2 3.26 (4.71X) 4.16 (3.69X) 10.93 (1.41x) 15.36 
STAGE, View 1 3.15 (4.21xX) 3.71 (3.57x) 6.31 (2.10) 13.25 
Stace, View 2 3.30 (4.33xX) 4.32 (3.31x) 7.67 (1.86x) 14.29 
STAGE, View 3 4.29 (4.66X) 5.73 (3.49x) 11.02 (1.81x) 19.98 
Stace, View 4 3.49 (4.74x) 4.39 (3.77X) 8.68 (1.90X) 16.53 
Stace, View 5 3.45 (2.26x) 4.12 (1.89x) 7.68 (1.01) 7.78 
Average 3.51 (4.14x) 4.40 (3.31x) 8.89 (1.64x) 14.54 


We include FLIP [Andersson et al. 2020] false-color error images in 
corners to illustrate the perceived difference when toggling between 
the neural and reference BRDFs renders; all images are also provided 
as part of the supplemental material to facilitate such inspection. 
Table 3 lists average errors using a variety of standard image error 
metrics. The supplemental also includes polar plots for the learned 
materials with different decoder sizes. 


8.2 Runtime performance 


The smallest network yields the best rendering performance, al- 
beit at reduced reconstruction accuracy. Table 4 lists the absolute 
performance in ms/SPP and the relative speed improvement over 
rendering a GPU-optimized implementation of the reference ma- 
terial (all running on NVIDIA GeForce RTX 4090 GPU). The full 
frame rendering times with the neural BRDFs are 1.64x (3 X 64) to 
4.14x (2 x 16) faster than the reference material on average. 

The frame time includes both general path tracing operations 
(light sampling, ray tracing, and control logic) as well as material 
sampling and evaluation. To estimate how much time is spent in 
material shading, and thus the relative speedups of our neural mate- 
rials over the reference materials, we setup a dedicated benchmark. 
Since all neural material shaders in our system are running inline in 
the renderer, not as separate kernels, this has to be done with care; 
we lock the path distribution to a simple cosine-weighted distribu- 
tion, while ensuring that the compiler does not eliminate any of the 
material code. As a baseline, we measure the pure path tracing cost 
using a material with constant color. 

Figure 16 and Table 5 summarizes our findings for two represen- 
tative views of the INKWELL scene (Figure 14, view 1 & 2) and STAGE 
scene (Figure 15, view 3 & 4). The shading times with the neural 
BRDFs are 2.30x (3 x 64) to 9.06x (2 x 32) faster than the reference 
materials on average, with over an order of magnitude speedup for 
several views and the mid-sized BRDF decoder (2 x 32). 

Overall, the performance and visual fidelity scale in a predictable 
manner as neural BRDFs accommodate trading quality for perfor- 
mance. Next, we analyze the scaling behavior in more detail. 


8.3 Scalability 


Figure 17 shows that performance scales favorably when increasing 
the number of neural materials. For this test we render the CAKE BOX 
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STAGE timings (ms) INKWELL timings (ms) 


187 E Material shading 
144 E Path tracing 


187 E Material shading 
141 E Path tracing 


2X 32 3 X 64 Reference 2x32 3 X 64 Reference 


Fig. 16. Average path tracing and material shading time in ms, respectively, 
for rendering a 1 SPP image of the scene at 1920x1080 pixels resolution 
using paths up to six path vertices in length. Two different BRDF decoder 
architectures are profiled, and compared to the cost of shading using the 
reference materials. 


Table 5. Material shading performance in ms/SPP with two different BRDF 
decoder architectures (importance sampler is always 3 x 32). Column labels 
denote the number and width of hidden layers. Numbers in parenthesis 
show speed up over the reference material, reported in the last column. 


2X 32 3 x 64 Ref. 


STAGE, View 3 
STAGE, View 4 
INKWELL, View 1 
INKWELL, View 2 


1.59 (10.19x) 6.02 (2.69) 16.21 
1.23 (12.82x) 5.06 (3.12x) 15.77 
1.59 (6.99x) 6.01 (1.85x) 11.11 
1.74 (7.25x) 7.15 (1.76) 12.61 
( ) 


Average 1.54 (9.06x) 6.06 (2.30x 13.93 


scene (Figure 13) and vary the number of (different) neural materials, 
while keeping geometry and path distribution identical. Paths up to 
ten vertices in length are traced and the scene also contains a small 
number of traditional materials, in order to introduce significant 
execution and data divergence. 

For very small numbers of neural materials, the network param- 
eters fit in caches close to the shader cores, whereas with more 
materials the parameters are increasingly streamed in from L2 or 
global memory. Our approach based on a megakernel path tracer 
with local work reordering manages to extract enough coherency 
to amortize the cost of memory loads well. 


Discussion. It is difficult to do a direct comparison to previous 
work as our focus is different; we show that neural materials can run 
efficiently in real-time shaders even in divergent workloads such as 
path tracing. There are few examples of inferencing in traditional 
shaders. One exception is deep shading [Nalbach et al. 2017] that runs 
a forward pass in GLSL for traditional deferred shading. Research 
on neural appearance models have generally used CUDA kernels, 
either directly or via machine learning frameworks. 

Fan et al. [2022] record all intersections to global memory and 
shade in a deferred manner, precluding adaptiveness and paying 
the cost of memory transfers. The authors report a single BRDF 
evaluation per pixel with resolution 1920 x 1080 costing 5 ms on an 
NVIDIA RTX 2080Ti. NeuMIP [Kuznetsov et al. 2021] implement 
an interactive CUDA/OptixX-based path tracer and report similar 
performance of 5 ms per evaluation at the same resolution/GPU. 
The paper is scarce on details; in personal communication it was 
stated that the reported 60 frames per second path tracing applies 


Rendering time (ms) for increasing number of neural materials 
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Fig. 17. Rendering times for path tracing a 1 SPP image of the CAKE BOX 
scene with varying numbers of neural materials. The measurements show 
that our method is insensitive to the divergence introduced by path tracing 
scenes with many neural materials; rendering times stay near constant 
as material count increases. Two different BRDF decoder architectures 
are studied. The path distribution is kept fixed to isolate the effects on 
performance from scaling the number of materials. 


to relatively short paths in a simple scene with a single material. 
Scaling to multiple materials is not explored. 

We believe the scalability, handling of divergent shaders, and inte- 
gration in real-time shading languages are important contributions 
of our work for ease of adoption of neural materials more widely. 


9 LIMITATIONS & FUTURE WORK 


Energy conservation and reciprocity. Because the neural material 
is only an approximate fit of the input material, it is not guaranteed 
to be energy conserving. Although we have not observed this to be 
a problem in our tests, this could become an issue for high albedo 
materials with high orders of bounces (e.g. white fur). Enforcing 
energy conservation would require the network to output in a form 
that is analytically integrable, or integrates to a known value. The 
latter can be achieved with normalizing flows (as in [Müller et al. 
2020]) at an increased evaluation cost. Our BRDF model is currently 
not reciprocal, but reciprocity could be enforced with the modi- 
fied Rusinkiewicz encoding of directions [Zheng et al. 2021]. We 
opted for the Cartesian parameterization of directions that was more 
numerically stable in our experiments and yielded better visuals. 


Displacement. We do not currently support effects that affect 
surface geometry, such as displacement mapping. We implemented 
the neural displacement approach of Kuznetsov et al. [2021], and 
tested several variations that include geometric priors, but we found 
that this approach is always outperformed by fixed-function ray 
marching, both in terms of bandwidth and runtime. None of these 
approaches were sufficiently fast to reach our performance goals, 
but we expect additional research to make them viable alternatives. 


Filtering. Although neural prefiltering is effective at preventing 
aliasing, we report that, while the finest level is very accurate, the 
coarser levels of the latent pyramid tend to produce softer appear- 
ance than the supersampled reference BRDF. This is likely because 
the inputs to the encoder correlate strongly with the appearance only 
at the finest level. In case of coarser levels, the encoder consumes 
prefiltered material parameters, where the correlation is weaker 


and the auto-encoder thus performs worse. Finetuning improves 
the quality somewhat, but cannot escape the initial local minimum. 


Alternative geometric priors. We tested a number of alternative 
implementations of the rotation prior (Section 4.2), ranging from un- 
constrained, high-dimensional affine transforms inspired by the gen- 
erality of self-attention layers [Vaswani et al. 2017] to rotation-only 
matrices. Our final solution uses normalized (but not orthogonal) 
normal n and tangent t from the network output, with bitangent 
b = n X t/||n x t||. Additionally, we tested explicitly supervising 
the extracted TBN frames against frames of the reference material, 
with an optional asymmetric loss [Vogels et al. 2018]. This occasion- 
ally improved the results (e.g., for glints), but the training requires 
extensive hyperparameter tuning; hence we excluded it from results. 


Training stability and time. We occasionally found training to 
converge to local minima with large visual differences based on 
small perturbations of hyperparameters or weight initialization. For 
instance, the smallest network configuration could not reliably pre- 
serve the highly specular glazing of the TEAPoT so we chose to 
include a version without it in our results (Figure 15). We want to 
investigate robustness more closely, also while scaling to a larger 
target material diversity. At the same time, we would like to sig- 
nificantly reduce training times (ideally from hours to minutes) to 
improve iteration times when developing further enhancements 
and to make the current iteration of the system more practical. 


Refraction. We evaluate our method only on purely reflective 
materials. Extending our model to transmissive materials poses the 
following challenge: physically based renderers require knowing 
the index of refraction of the material to maintain reciprocity after 
refracting. While the network could be trained to produce the index 
as an additional output, it is difficult to guarantee that this trained 
value matches the actual behavior of the BRDF; this topic deserves 
special attention in the future. 


10 CONCLUSION 


We present a complete real-time neural materials system. The model 
jointly addresses evaluation, sampling, and filtering of highly com- 
plex and detailed materials. We achieve this by combining ideas 
from prior works with new graphics priors and training strategies 
to achieve higher quality and faster training. A key contribution 
of our work is that such comprehensive solutions can be imple- 
mented efficiently on modern graphics hardware; we propose to 
deploy the neural network to the innermost rendering loop to reduce 
bandwidth requirements. In our tests, the neural BRDFs achieve 
state-of-the-art rendering performance, outperform optimized GPU 
implementations of reference multi-layered classical materials, and 
scale to multiple materials in a scene. We believe the presented 
neural BRDFs can serve as “baked” versions of complex materials; 
as well as increased performance and lower memory consumption, 
this enables easy interchange of arbitrarily complex materials be- 
tween different workflows and tools, simply by exchanging a fixed 
set of latent textures and a small table of MLP weights. Lastly, we 
hope this article will stimulate new investigations of using small 
neural networks in real-time for lighting, and geometry and volume 
rendering. 
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A IMPORTANCE SAMPLING DETAILS 


The following outlines the implementation details of our analytic 
proxy model used for importance sampling. 


Probability density. Like prior work [Fan et al. 2022; Sztrajman 
et al. 2021] our sampling density is a simple linear blend between a 
diffuse and specular term 


P(@o) = Wa ` Pa(@o) + Ws - ps(@o), (5) 
where wg + ws = 1. The diffuse PDF pg is a simple cosine-weighted 
distribution but tilted by a normal vector computed from a predicted 
2D surface slope (idx, Hdy) as 


ng = Normalize([—yax,—Hay 1])- (6) 
The specular PDF ps takes the form of a standard microfacet 
density using a Trowbridge-Reitz NDF [Trowbridge and Reitz 1975; 


Walter et al. 2007] with elliptical anisotropy and non-centered mean 
surface slopes [Dupuy 2015]: 


Mle, det (M71) 1 
[M Toll) IM opl? 4100 ol 


where œp = Normalize (wi + wo) is the half vector and Dgtg is the 
(isotropic) NDF with unit roughness («œ = 1), transformed based on 


Ps(@o) = Dstd (7) 


Ox 0 —Hs,x 
M=|ayp ayy1- p’ =Hs,y | - (8) 
0 0 1 


Here, the elliptical anisotropy is described by two orthogonal rough- 
ness values ax, ay with correlation parameter p and the mean of 
the NDF is offset by a 2D surface slope (sis,x, Hs,y). 

The last two terms in Equation (7) are the Jacobian determinants 
accounting for the transformation (and subsequent normalization) 
of œp, as well as the change of variables between œp and wo. 


Sampling. The sample transform W first selects one of the two 
PDF terms (Equation (5)) based on the relative weights wg and ws. 
If the diffuse component is chosen we simply generate a cosine- 
weighted outgoing direction œo and tilt it based on ng. Otherwise, 
we perform specular reflection along a sampled half-vector 


@}, = Normalize(M - Wa (u)) (9) 
where Wq is the usual isotropic NDF sampling technique (a = 1). 


Network prediction. We dropped the explicit dependence of p and 
W on æj and x above for brevity, but our full set of 9 proxy pa- 
rameters {wg, Hdx Hay» Ws Ox, @y, Ps Hs,x Hs,y} are the result of an 
MLP evaluation that takes these as input. To ensure that all in- 
ferred parameters lie in their respective valid ranges (a € [0,1], p € 
[-1, 1], y € [-c0, +00]) we append an appropriate final activation to 
each network output based on quadratic approximations of tanh(x) 
and sinh(x). Lastly, wg and wg are processed by the softmax func- 
tion to form valid mixing weights that add up to one. 


